{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F0rgZRFZRg2N"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/advanced_techniques/retrieval_strategies_mongodb_llamaindex_togetherai.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RjQo4p1yRg2N"
      },
      "source": [
        "# Optimizing for relevance using MongoDB, LlamaIndex and Together.ai\n",
        "\n",
        "In this notebook, we will explore and tune different retrieval options in MongoDB's LlamaIndex integration using Together.ai to get the most relevant results."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YVgOw8nigfa2"
      },
      "source": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "maXePwaIlbWq"
      },
      "outputs": [],
      "source": [
        "# Find out where your google colab is running. If possible, create your atlas\n",
        "# cluster in the nearest google cloud region to reduce latency during embedding\n",
        "!curl ipinfo.io"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TScxhzzCoi9q"
      },
      "source": [
        "## Step 1: Install libraries\n",
        "\n",
        "- **pymongo**: Python package to interact with MongoDB databases and collections\n",
        "<p>\n",
        "- **llama-index**: Python package for the LlamaIndex LLM framework\n",
        "<p>\n",
        "- **llama-index-llms-together**: Python package to use TogetherAI models via their LlamaIndex integration\n",
        "<p>\n",
        "- **llama-index-vector-stores-mongodb**: Python package for MongoDB’s LlamaIndex integration"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PqqPt3h_UbeG"
      },
      "outputs": [],
      "source": [
        "!pip install -qU pymongo llama-index llama-index-vector-stores-mongodb together \\\n",
        "llama-index-llms-together llama-index-embeddings-together datasets"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1Qxr_jwvRg2O"
      },
      "source": [
        "## Step 2: Setup prerequisites\n",
        "\n",
        "- **Set the MongoDB connection string**: Follow the steps [here](https://www.mongodb.com/docs/manual/reference/connection-string/) to get the connection string from the Atlas UI.\n",
        "\n",
        "- **Set the Together.ai API key**: Steps to obtain an API key as [here](https://docs.together.ai/reference/authentication-1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Bs3Safw_Uj00"
      },
      "outputs": [],
      "source": [
        "import getpass\n",
        "import os\n",
        "\n",
        "from pymongo import MongoClient"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "g0GJ9efPUtfA"
      },
      "outputs": [],
      "source": [
        "os.environ[\"TOGETHER_API_KEY\"] = getpass.getpass(\"Enter your Together.AI API key: \")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qa2Bn-N-pp9a"
      },
      "outputs": [],
      "source": [
        "MONGODB_URI = getpass.getpass(\"Enter your MongoDB URI: \")\n",
        "mongodb_client = MongoClient(\n",
        "    MONGODB_URI, appname=\"devrel.showcase.retrieval_strategies_llamaindex\"\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rn4FIfvSo33q"
      },
      "source": [
        "## Step 3: Load and process the dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KW6nQKo7Rg2P"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "from datasets import load_dataset\n",
        "from llama_index.core import Document"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jMYkRQwiVag2"
      },
      "outputs": [],
      "source": [
        "data = load_dataset(\"MongoDB/embedded_movies\", split=\"train\")\n",
        "data = pd.DataFrame(data)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gX50GVenRg2P"
      },
      "outputs": [],
      "source": [
        "data.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UvakMBCcRg2P"
      },
      "outputs": [],
      "source": [
        "# Fill Nones in the dataframe\n",
        "data = data.fillna({\"genres\": \"[]\", \"languages\": \"[]\", \"cast\": \"[]\", \"imdb\": \"{}\"})"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ia96lO-5Rg2Q"
      },
      "outputs": [],
      "source": [
        "documents = []\n",
        "\n",
        "for _, row in data.iterrows():\n",
        "    # Extract required fields\n",
        "    title = row[\"title\"]\n",
        "    rating = row[\"imdb\"].get(\"rating\", 0)\n",
        "    languages = row[\"languages\"]\n",
        "    cast = row[\"cast\"]\n",
        "    genres = row[\"genres\"]\n",
        "    # Create the metadata attribute\n",
        "    metadata = {\"title\": title, \"rating\": rating, \"languages\": languages}\n",
        "    # Create the text attribute\n",
        "    text = f\"Title: {title}\\nPlot: {row['fullplot']}\\nCast: {', '.join(item for item in cast)}\\nGenres: {', '.join(item for item in  genres)}\\nLanguages: {', '.join(item for item in languages)}\\nRating: {rating}\"\n",
        "    documents.append(Document(text=text, metadata=metadata))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NyRDy3IURg2Q"
      },
      "outputs": [],
      "source": [
        "print(documents[0].text)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JPrAL1AlRg2Q"
      },
      "outputs": [],
      "source": [
        "print(documents[0].metadata)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VhRChUYVRg2Q"
      },
      "outputs": [],
      "source": [
        "from llama_index.core import StorageContext, VectorStoreIndex\n",
        "from llama_index.core.settings import Settings\n",
        "from llama_index.embeddings.together import TogetherEmbedding\n",
        "from llama_index.llms.together import TogetherLLM\n",
        "from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch\n",
        "from pymongo.errors import OperationFailure\n",
        "from pymongo.operations import SearchIndexModel"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Yc1jPxtpip7y"
      },
      "source": [
        "## Step 4: Define the LLM and Embedding Model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "r4byJqYaRg2Q"
      },
      "outputs": [],
      "source": [
        "Settings.llm = llm = TogetherLLM(model=\"mistralai/Mistral-7B-v0.1\")\n",
        "\n",
        "Settings.embed_model = embed_model = TogetherEmbedding(\n",
        "    model_name=\"togethercomputer/m2-bert-80M-2k-retrieval\"\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sFUS0_EURg2Q"
      },
      "source": [
        "## Step 5: Create MongoDB Atlas Vector store"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1ksl0aknRg2Q"
      },
      "outputs": [],
      "source": [
        "VS_INDEX_NAME = \"vector_index\"\n",
        "FTS_INDEX_NAME = \"fts_index\"\n",
        "DB_NAME = \"mdb_llamaindex_together\"\n",
        "COLLECTION_NAME = \"hybrid_search\"\n",
        "collection = mongodb_client[DB_NAME][COLLECTION_NAME]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "334qDh8Fj3-j"
      },
      "source": [
        "### Create Atlas Vector Index"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vGAdaeu0Rg2Q"
      },
      "outputs": [],
      "source": [
        "vector_store = MongoDBAtlasVectorSearch(\n",
        "    mongodb_client,\n",
        "    db_name=DB_NAME,\n",
        "    collection_name=COLLECTION_NAME,\n",
        "    vector_index_name=VS_INDEX_NAME,\n",
        "    fulltext_index_name=FTS_INDEX_NAME,\n",
        "    embedding_key=\"embedding\",\n",
        "    text_key=\"text\",\n",
        ")\n",
        "# If the collection has documents with embeddings already, create the vector store index from the vector store\n",
        "if collection.count_documents({}) > 0:\n",
        "    vector_store_index = VectorStoreIndex.from_vector_store(vector_store)\n",
        "# If the collection does not have documents, embed and ingest them into the vector store\n",
        "else:\n",
        "    vector_store_context = StorageContext.from_defaults(vector_store=vector_store)\n",
        "    vector_store_index = VectorStoreIndex.from_documents(\n",
        "        documents, storage_context=vector_store_context, show_progress=True\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I0vBccQaj-o6"
      },
      "source": [
        "### Create Atlas Search Index"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "T5k_6eZQRg2Q"
      },
      "outputs": [],
      "source": [
        "vs_model = SearchIndexModel(\n",
        "    definition={\n",
        "        \"fields\": [\n",
        "            {\n",
        "                \"type\": \"vector\",\n",
        "                \"path\": \"embedding\",\n",
        "                \"numDimensions\": 768,\n",
        "                \"similarity\": \"cosine\",\n",
        "            },\n",
        "            {\"type\": \"filter\", \"path\": \"metadata.rating\"},\n",
        "            {\"type\": \"filter\", \"path\": \"metadata.languages\"},\n",
        "        ]\n",
        "    },\n",
        "    name=VS_INDEX_NAME,\n",
        "    type=\"vectorSearch\",\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YxQ7MpbrRg2Q"
      },
      "outputs": [],
      "source": [
        "fts_model = SearchIndexModel(\n",
        "    definition={\"mappings\": {\"dynamic\": False, \"fields\": {\"text\": {\"type\": \"string\"}}}},\n",
        "    name=FTS_INDEX_NAME,\n",
        "    type=\"search\",\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XcsshlCURg2Q"
      },
      "outputs": [],
      "source": [
        "for model in [vs_model, fts_model]:\n",
        "    try:\n",
        "        collection.create_search_index(model=model)\n",
        "    except OperationFailure:\n",
        "        print(f\"Duplicate index found for model {model}. Skipping index creation.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ePrjMar0Rg2Q"
      },
      "source": [
        "## Step 6: Get movie recommendations"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dKZaq4d0Rg2Q"
      },
      "outputs": [],
      "source": [
        "def get_recommendations(query: str, mode: str, **kwargs) -> None:\n",
        "    \"\"\"\n",
        "    Get movie recommendations\n",
        "\n",
        "    Args:\n",
        "        query (str): User query\n",
        "        mode (str): Retrieval mode. One of (default, text_search, hybrid)\n",
        "    \"\"\"\n",
        "    query_engine = vector_store_index.as_query_engine(\n",
        "        similarity_top_k=5, vector_store_query_mode=mode, **kwargs\n",
        "    )\n",
        "    response = query_engine.query(query)\n",
        "    nodes = response.source_nodes\n",
        "    for node in nodes:\n",
        "        title = node.metadata[\"title\"]\n",
        "        rating = node.metadata[\"rating\"]\n",
        "        score = node.score\n",
        "        print(f\"Title: {title} | Rating: {rating} | Relevance Score: {score}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Query"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Find top 5 highest rated movies (rating >= 8.0)\n",
        "print(\"\\nTop 5 highest rated movies (rating >= 8.0):\")\n",
        "\n",
        "top_movies = list(\n",
        "    collection.find(\n",
        "        {\"metadata.rating\": {\"$gte\": 8.0}},  # filter\n",
        "        {\"metadata.title\": 1, \"metadata.rating\": 1, \"_id\": 0},  # projection\n",
        "    )\n",
        "    .sort(\"metadata.rating\", -1)  # sort\n",
        "    .limit(5)\n",
        ")  # limit\n",
        "\n",
        "for movie in top_movies:\n",
        "    print(f\"Title: {movie['metadata']['title']}, Rating: {movie['metadata']['rating']}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Aggregate"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "# Calculate average ratings by language, print top 10\n",
        "pipeline = [\n",
        "    {\"$unwind\": \"$metadata.languages\"},\n",
        "    {\n",
        "        \"$group\": {\n",
        "            \"_id\": \"$metadata.languages\",\n",
        "            \"average_rating\": {\"$avg\": \"$metadata.rating\"},\n",
        "        }\n",
        "    },\n",
        "    {\"$sort\": {\"average_rating\": -1}},\n",
        "    {\"$limit\": 10},\n",
        "]\n",
        "\n",
        "results = collection.aggregate(pipeline)\n",
        "print(\"\\nAverage ratings by language:\")\n",
        "for result in results:\n",
        "    print(f\"Language: {result['_id']}, Average Rating: {result['average_rating']:.2f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NqKdZdnvRg2R"
      },
      "source": [
        "### Full-text search"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zJGawF2TRg2R"
      },
      "outputs": [],
      "source": [
        "get_recommendations(\n",
        "    query=\"Action movies about humans fighting machines\",\n",
        "    mode=\"text_search\",\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L_oL4Oh1Rg2R"
      },
      "source": [
        "### Vector search"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZXSpdBhLRg2R"
      },
      "outputs": [],
      "source": [
        "get_recommendations(\n",
        "    query=\"Action movies about humans fighting machines\", mode=\"default\"\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iKPcfgRKRg2R"
      },
      "source": [
        "### Hybrid search"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GQEaH48pRg2R"
      },
      "outputs": [],
      "source": [
        "# Vector and full-text search weighted equal by default\n",
        "get_recommendations(query=\"Action movies about humans fighting machines\", mode=\"hybrid\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-4dGkkI5Rg2R"
      },
      "outputs": [],
      "source": [
        "# Higher alpha, vector search dominates\n",
        "get_recommendations(\n",
        "    query=\"Action movies about humans fighting machines\",\n",
        "    mode=\"hybrid\",\n",
        "    alpha=0.7,\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jIhe37T8Rg2R"
      },
      "outputs": [],
      "source": [
        "# Lower alpha, full-text search dominates\n",
        "get_recommendations(\n",
        "    query=\"Action movies about humans fighting machines\",\n",
        "    mode=\"hybrid\",\n",
        "    alpha=0.3,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gNEFfpvBRg2R"
      },
      "source": [
        "### Combining metadata filters with search"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jYvjHrLzRg2R"
      },
      "outputs": [],
      "source": [
        "from llama_index.core.vector_stores import (\n",
        "    FilterCondition,\n",
        "    FilterOperator,\n",
        "    MetadataFilter,\n",
        "    MetadataFilters,\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AEdProsWRg2R"
      },
      "outputs": [],
      "source": [
        "filters = MetadataFilters(\n",
        "    filters=[\n",
        "        MetadataFilter(key=\"metadata.rating\", value=7, operator=FilterOperator.GT),\n",
        "        MetadataFilter(\n",
        "            key=\"metadata.languages\", value=\"English\", operator=FilterOperator.EQ\n",
        "        ),\n",
        "    ],\n",
        "    condition=FilterCondition.AND,\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NUUGIhyNRg2R"
      },
      "outputs": [],
      "source": [
        "get_recommendations(\n",
        "    query=\"Action movies about humans fighting machines\",\n",
        "    mode=\"hybrid\",\n",
        "    alpha=0.7,\n",
        "    filters=filters,\n",
        ")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.1"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "state": {}
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
